[ET-VK] Quantized Int8 Convolution + Linear #13811
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Title says it all!
This PR adds implementations for int8 quantized convolution and linear layers. Convolution is implemented as matrix multiplication under the hood by using the im2col procedure.
For both linear and convolution, two versions are implemented:
q8ta_q8cswvariant which quantized the input tensor and then performs integer accumulation via the int8 dot product extensionq8cswvariant which dequantized the weight tensor in-shader and performs floating point accumulation.The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.
These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.
Differential Revision: D81323424